-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix commit idempotence of DynamicIcebergSink #14092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix commit idempotence of DynamicIcebergSink #14092
Conversation
83cc431 to
5e36867
Compare
mxm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @aiborodin! I'm going to have to digest this a bit. Seems like we are addressing two issues here:
- Fixing a potential issue in DynamicCommitter with processing multiple WriteResult for a given table.
- Refactoring the commit logic to produce one snapshot per spec change.
It would make things easier to review if we fixed (1) and then (2), perhaps in separate commits. I need to dig my head a bit deeper into the changes to understand them properly.
| @Override | ||
| public String toString() { | ||
| return MoreObjects.toStringHelper(this) | ||
| .add("dataFiles", dataFiles) | ||
| .add("deleteFiles", deleteFiles) | ||
| .add("referencedDataFiles", referencedDataFiles) | ||
| .add("rewrittenDeleteFiles", rewrittenDeleteFiles) | ||
| .toString(); | ||
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object other) { | ||
| if (other == null || getClass() != other.getClass()) { | ||
| return false; | ||
| } | ||
| WriteResult that = (WriteResult) other; | ||
| return Objects.deepEquals(dataFiles, that.dataFiles) | ||
| && Objects.deepEquals(deleteFiles, that.deleteFiles) | ||
| && Objects.deepEquals(referencedDataFiles, that.referencedDataFiles) | ||
| && Objects.deepEquals(rewrittenDeleteFiles, that.rewrittenDeleteFiles); | ||
| } | ||
|
|
||
| @Override | ||
| public int hashCode() { | ||
| return Objects.hash( | ||
| Arrays.hashCode(dataFiles), | ||
| Arrays.hashCode(deleteFiles), | ||
| Arrays.hashCode(referencedDataFiles), | ||
| Arrays.hashCode(rewrittenDeleteFiles)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we strictly need equals / hashCode for the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking because this is in core and potentially affects other engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking because this is in core and potentially affects other engines.
You're right, it's not strictly necessary for the implementation. However, there have been a few instances in this codebase of decomposing and manually checking subsets of WriteResult or DataFile/DeleteFile due to a lack of these methods, so they simplify DynamicWriteResultAggregator tests, as well as DynamicWriteResult and WriteResult serialisation tests.
I don't see any harm in adding them. Why would other engines rely on no implementations of equals / hashCode? In this codebase, apart from Flink, it's only used in Spark in SparkPositionDeltaWrite as a simple data class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very easy to rely on equals/hashCode inadvertently. Consider HsshSet, HashMap, etc.
Equals is typically used many places, but before this change it defaults to instance equality which is very different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's clear. Do you have any specific examples where you think this could be problematic? I can't think of any use case where someone would use BaseFile implementations as keys in a HashMap or put them in a HashSet without meaningful equals/hashCode implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pvary I am happy to extract the equals/hashCode into a separate PR and use manual decomposition of properties of WriteResult and Data/DeleteFiles in tests in this PR if you prefer. Although I still believe it should be totally fine to add these methods to simplify the code.
|
There is no guarantee, that the WriteResult serialization would remain compatible between Iceberg releases. OTOH the manifest serialization should be backwards compatible. This is the reason why manifest is used to store the temporary data which is part of a checkpoint. I don't think we would like to change this. Also writing temporary files is costly. We need to check the cost of writing more manifest files, if we decide to go down that way. We might be better off with writing more sophisticated commit logic, and generating better commit keys (maybe add specId too) |
5e36867 to
dca52ce
Compare
@mxm How do you envision splitting these two points? The root problem is |
|
@pvary, that's a good point. I think we can achieve the same backwards compatibility guarantees without writing temporary manifest files. And instead, write data and delete files directly into a checkpoint as Avro data. Let me experiment with this and try to get a version with Avro serialisation working, which would give us backwards compatibility. |
There is an additional thing we should consider. What happens when the attributes of the Data/Delete files are changing. With the current way - storing the manifest -, we are automatically picking up changes done in the core. If we implement our own serializer/deserializer for them, we need to keep an eye out for these changes, and implement the corresponding changes in our serializers every time. |
I've left some thoughts in #14090. I also have an implementation which I'll share. We can discuss then which is the best approach to proceed. Basically, the idea is to use as few snapshots as possible. We can combine append-only WriteResults into a single snapshot. Whenever delete files are present, we need multiple snapshots. To make this fault tolerant, we need to store an index into the list of write results and persist it as part of the snapshot summary, similarly to the Flink checkpoint id. We can then skip any previously applied WriteResults on recovery. As for multiple partition spec per snapshot, I couldn't find that this is not permitted in Iceberg. |
@mxm Could you please clarify why we need multiple snapshots per checkpoint when delete files are present? Wouldn't Iceberg core already handle deletions in the aggregated |
|
@aiborodin Regular data files and delete files can both be part of the same snapshot (actually, have to be for upsert semantics). However, we have to create a table snapshot before we process WriteResults with delete files. The reason is that data and delete files are not ordered, but deletes often require an order to be applied correctly. For example: WriteResult If we would combine The semantics are totally different when we first create a table snapshot for |
@mxm, so are you saying we shouldn't aggregate WriteResults with delete files? If so, it will create too many commits/snapshots with high writer parallelisms, as each writer would emit a separate WriteResult. |
|
Yes. That's why the code in the main branch is designed that way. IMHO we can only aggregate WriteResults without delete files (append-only), unless we change Iceberg core to enforce an order on how data / delete files are applied. I'd be curious to hear @pvary's thoughts on this. |
We can only commit files for multiple checkpoints when there are only appends/data files in the checkpoint. For Iceberg everything which is committed in a single transaction happened at the same time. So if we have equality deletes for both checkpoints, then they will be applied together. Consider this scenario:
If we merge C2 and C3 commit, then we add EQ1 and EQ2 in the same commit, and they will be applied only for DF1. They will not be applied to DF2 or DF3 as they are added in the same commit, and as a result we will have a duplication in our table. Both R1' and R1'' will be present after C3 |
Thank you for the detailed clarification on this, @pvary and @mxm. My change does not aggregate WriteResults across checkpoints. Each checkpoint would create a separate snapshot with its own delete and data files. The private void commitDeltaTxn(
Table table,
String branch,
NavigableMap<Long, Committer.CommitRequest<DynamicCommittable>> pendingRequests,
CommitSummary summary,
String newFlinkJobId,
String operatorId) {
for (Map.Entry<Long, CommitRequest<DynamicCommittable>> e : pendingRequests.entrySet()) {
// We don't commit the merged result into a single transaction because for the sequential
// transaction txn1 and txn2, the equality-delete files of txn2 are required to be applied
// to data files from txn1. Committing the merged one will lead to the incorrect delete
// semantic.
WriteResult result = e.getValue().getCommittable().writeResult();The current change would only aggregate WriteResults across multiple parallel writers per (table, branch, checkpoint) triplet, similar to the current non-dynamic Given the above, @mxm, do you still think we need to commit multiple WriteResults separately for (table, branch, checkpoint) triplet and implement the index-based solution to guarantee idempotency as you mentioned here: #14090 (comment)? If so, could you please explain why this solution is necessary? |
dca52ce to
2ecb99c
Compare
|
@pvary @mxm I added the |
|
Hey folks! I've implemented the approached outlined in 1 and 2: @aiborodin I just saw you also pushed an update. I'll take a look. |
@pvary I wasn't suggesting that we combine WriteResults from multiple Flink checkpoints. I'm suggesting to combine the append-only WriteResults in each Flink checkpoint. Currently, every WriteResult is processed separately, which creates a lot of table snapshots.
@aiborodin Each table / branch pair requires a separate table snapshot. While we could combine multiple Flink checkpoints during recovery, I don't think there is much benefit from doing that. Apart from recovery, every checkpoint would normally be processed independently. We wouldn't gain much from optimizing the snapshots by combining commit request from multiple checkpoints. |
@mxm Just combining the append-only WriteResults in each Flink checkpoint would not resolve the issue for WriteResults with delete files (see my comment in #14182 (comment)). We should combine both appends and deletes within the same checkpoint (implemented in this change), which is valid for equality delete semantics because records with the same equality fields would always be routed to the same writer, storing all unique equality delete keys in the same delete file.
My suggestion is to combine both appends and deletes for the same checkpoint, which I implemented in this change. |
I've responded in #14182 (comment). In a nutshell, my code was doing exactly what you suggest here, committing all appends / deletes from a checkpoint, but it would additionally add WriteResults from other checkpoint if those did not contain delete files. I noticed that the code was too hard to understand, so I removed this optimization.
IMHO the solution you implemented goes beyond the scope of fixing the issue. I'm open to refactoring the aggregator / committer code, but I think we should add a set of minimal changes to fix the correctness issue, along side with increased test coverage. This is what I think #14182 achieves. |
@mxm I do understand the simplicity of #14182, and I am okay with merging it as a first step. Although I think we should follow up and merge this PR as well to draw a clear boundary between aggregating temporary WriteResults ( |
2ecb99c to
67bd9d1
Compare
DynamicWriteResultAggregator currently produces multiple committables per table/branch/checkpoint triplet because it aggregates write results by WriteTarget, which is unique per schemaId, specId, and equality fields. It violates the idempotence contract of the DynamicCommitter, as it relies on one commit request per triplet to identify and skip already committed requests during recovery. Fix this by aggregating WriteResult objects by table and branch (aka TableKey), which would emit a single committable per checkpoint. It requires serialising the aggregated WriteResult and saving it in a Flink checkpoint instead of a temporary manifest file, because, according to the Iceberg spec, a single manifest must contain files with only one partition spec, while we may aggregate write results for potentially multiple partition specs. Change-Id: Id5cf832af4d6f64bd619635c937e216d84df4f5b
67bd9d1 to
3294661
Compare
|
@aiborodin: We are trying to do too many things parallel in this PR and it makes it hard to review. Here are the separate features I see here:
Here is how I see the changes 1-by-1:
Thank you for reporting the issue and working on the solution! |
|
@pvary thanks for summarising and highlighting the features. I will work on extracting them into separate PRs. |
|
@pvary I raised #14312, which addresses your point 3. Could you please take a look? It moves commit aggregation to |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Addresses the following issue: #14090.
Fix the issue by aggregating
WriteResultobjects by table and branch (akaTableKey), which would emit a single committable per checkpoint. It requires serialising the aggregatedWriteResultand saving it in a Flink checkpoint instead of a temporary manifest file, because, according to the Iceberg spec, a single manifest must contain files with only one partition spec, while we may aggregate write results for potentially multiple partition specs.